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Redundancy lets a system perform its intended functions 
despite some number of faults 
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Fault-tolerant computing can in- 
crease the dependability of a 
computer system by providing 
more hardware, software, or in- 
formation than is necessary. This redun- 
dancy lets a system perform its intended 
functions despite some number of faults. 

Quantitatively, you can measure how 
dependable a system is in terms of either 
reliability or availability. System reli- 
ability is the probability that the system 
won't fail by a given time. If a system 
needs continuous error-free operation, a 
minimum reliability level must be main- 
tained over the system's useful lifetime. 

A system's maximum useful lifetime 
is the length of time in which its reliabil- 
ity remains greater than some specified 
minimum value. Fault-tolerant comput- 
ing is one method for increasing a sys- 
tem's useful lifetime. 

It's important to note two things, how- 
ever. First, for a given application, a sys- 
tem could be sufficiently reliable without 
fault tolerance. And second, using fault 
tolerance doesn't necessarily guarantee 
that a system will be sufficiently reliable 
for a particular application. 

If occasional, brief periods of down- 
time are acceptable in an application, an 
availability goal may be more appropri- 
ate. Availability is the probability that a 
system will be operational at any given 
moment; thus, it is the ratio of the sys- 
tem's uptime to the sum of its uptime and 
downtime. Availability is increased by 
using fault-tolerant designs to maximize 
uptime or minimize downtime. 

Faults and Errors 

The words fault and error sound like syn- 
onyms; they're not, however, and the 
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distinction between them is important. A 
fault is a physical condition that occurs in 
a hardware or software element, making 
the element unable to perform its in- 
tended function. An error, on the other 
hand, is a symptom of a fault and mani- 
fests itself as an incorrect output or in- 
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With fault-tolerant comput- 
ing, you can increase your 
system's dependability by 
selectively providing more 
hardware, software, or infor- 
mation than you need. You 
can also increase your sys- 
tem's useful lifetime. Fault 
tolerance means being able 
to detect, mask, and confine 
errors; diagnose faults; and 
repair, reconfigure, and re- 
cover your system. What is 
the key? Redundancy. 



valid state for the faulty element. 

A fault is referred to as latent when it 
occurs without producing errors during 
system operation. A common example 
would be a fault that alters the contents of 
a byte in memory. If the byte is not ac- 
cessed after this change, no error occurs. 

You can characterize faults by dura- 
tion and extent. Fault duration may be 
permanent, transient, or intermittent. A 
permanent fault doesn't disappear once it 
occurs. It results from failures of elec- 
tronic components or interconnections, 
physical damage, or design errors. De- 
sign errors are especially difficult to de- 
tect, since the affected hardware or soft- 
ware often performs as designed. 

Transient faults are temporary condi- 
tions, usually the result of electromag- 
netic interference, temperature, humid- 
ity, incorrect operating voltage, or other 
external disturbances. Transient faults 
typically disappear as soon as the exter- 
nal condition is eliminated. 

An intermittent fault alternates be- 
tween active and dormant states and is 
usually caused by poor design, border- 
line operating conditions, or the margin- 
al operation of a component prior to fail- 
ure. For both transient and intermittent 
faults, errors may remain after the fault 
disappears. 

Many systems use diagnostic pro- 
grams to locate faults. However, these 



diagnostics are only effective for perma- 
nent faults. Systems need to use other 
methods to locate transient and intermit- 
tent faults, which are far more likely to 
occur in real systems. 

The extent of a fault indicates how 
much of the system it affects. A local 
fault directly affects a single component, 
while a global fault influences multiple 
components. Most fault-tolerance strate- 
gies deal with a limited number of local- 
ized faults; they tend to leave systems 
vulnerable to global faults. 

Transient faults associated with exter- 
nal disturbances tend to be global in na- 
ture, since the entire system is typically 
exposed to the same condition. In con- 
trast, the failure of a transistor junction 
would directly affect only the component 
that contains it. 

Campaign Strategies 

The ability to tolerate faults requires a 
design strategy that includes one or more 
of the following elements: error detec- 
tion, error masking, error confinement, 
fault diagnosis, system repair and recon- 
figuration, and system recovery. 

To detect errors, you need to have 
enough redundancy that the system can 
distinguish between correct and incor- 
rect information. You can create redun- 
dant information by replicating the mod- 
ules that produce the information, by 
encoding the information so that errors 
result in detectable noncode words, or by 
using heuristics to determine whether the 
information is valid or reasonable (e.g., 
a square-root algorithm that produces a 
negative result would be faulty). 

For continuous error-free operation, 
the system must dynamically correct or 
mask errors. Error masking requires 
more redundancy than error detection 
does, because the system must extract the 
correct information from the informa- 
tion that the redundant configuration 
produced. Error masking also typically 
uses duplicate modules or extra bits to 
encode information. 

You can often mask the effects of tran- 
sient faults simply by retrying the opera- 
tion that failed. The bus interfaces of sev- 
eral current microprocessors let exter- 
nally detected errors initiate bus-cycle 
retries transparent to the software. 

To minimize the impact of a fault, you 
must establish error-containment bound- 
aries to confine errors to their originat- 
ing modules. You don't want them to 
propagate through the rest of the system. 
Error-containment boundaries prevent 
errors from spreading into or out of a 
module by checking all its inputs and 
outputs, respectively, and then isolating 
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the module from the rest of the system if 
an error is found. 

In systems that have enough resources 
to continue operating without one or 
more modules, you must not let a failed 
module affect the remaining resources. 
Whether or not the system can continue, 
limiting error propagation minimizes the 
amount of time needed to repair any 
damage. 

To repair the system, you must first 
analyze the errors to identify which com- 
ponents are faulty. How detailed a diag- 
nosis you need depends on your system 
repair and reconfiguration strategy. If 
you plan to replace faulty modules, 
whether automatically or manually, you 
just identify which module is faulty. 

You don't gain any benefit by analyz- 
ing faults further unless you plan to re- 
pair faulty modules; for example, the 
diagnostic programs of the Bell System's 
1 A Processor, which was the heart of the 
company's first electronic-switching 
systems, focused on isolating a problem 
only to the three replaceable modules it 
might occur in. This broad-brush ap- 
proach minimized repair time and in- 
creased system availability. 

In a fault-tolerant system, you must 
either replace a faulty component or 
route information around it to keep it 
from interfering with how the rest of the 
system operates. In most commercial and 
industrial applications, repair is manual; 
circuit boards are replaced by hand. 

Some commercially available fault- 
tolerant systems incorporate a "hot re- 
pair" capability. This lets you deener- 
gize a faulty board and remove it from 
the system, install and energize a spare 
board, and integrate the new board into 
the system, all without bringing the sys- 
tem down. The rest of the system can 
continue operations during this process. 

Where it's not practical to repair a sys- 
tem manually— such as in space vehicles 
or aircraft during flight operations— re- 
configuration must occur automatically. 
The system must be able to isolate the 
faulty module by switching off its power 
or otherwise segregating its outputs from 
the rest of the system. Once the system 
isolates the faulty module, it can switch 
on a spare module to replace the faulty 
one, or it can transfer the module's tasks 
to another operational unit. 

If errors have propagated in a system, 
if new hardware is introduced, or if work 
has been transferred between modules, 
you may have to restore the system's state 
or set it to some acceptable value before 
operations can continue. 

System recovery can be either forward 
or backward. To implement backward 



recovery, the system saves its state at var- 
ious checkpoints. After repair or recon- 
figuration, the system's state is restored 
to that of the last good checkpoint, and 
all processing is repeated from that 
point. It's important in backward recov- 
ery to identify those operations that can't 
be repeated, such as posting a deposit to 
a customer's bank account. 

When errors have not been signifi- 
cantly propagated through the system, 
you can implement forward recovery by 
masking errors or otherwise deriving a 
correct system state following the occur- 
rence of a fault. System operation can 
then simply continue without having to 
roll back to an earlier state. You would 
initialize any new hardware introduced 
during repair or reconfiguration to the 
current state of the system prior to con- 
tinuing. 

But if error propagation has been more 
significant, further recovery actions 
may be necessary. You may have to undo 
an interrupted database update to put the 
database in a consistent state. Or you 
may have to reacquire an object that a 
radar system was tracking. To minimize 
recovery time, it is critical that you en- 
force error-containment boundaries. 

Multiple Modules 

Active/backup module pairs. The most 
common form of modular redundancy is 



to simply replace a faulty module with a 
spare (see figure la). To do this, the ac- 
tive module must include internal error- 
detection mechanisms, and the system 
must properly transfer program control 
to the backup module. 

The backup unit can be hot or cold. A 
hot spare performs all computations in 
parallel with the active unit, and thus it 
always contains the correct state of the 
system, making the switchover instanta- 
neous. A cold spare can be either un- 
powered or used for other work until the 
active module fails. 

Tandem Computer's (Cupertino, CA) 
NonStop systems use active/backup pro- 
cess pairs. The backup process remains 
dormant, letting the computer perform 
other tasks. The active module sends 
state information to the backup at various 
process checkpoints, so the backup can 
initiate execution from the last check- 
point if the active module fails. A cold 
spare, which is used for other work, in- 
creases overall production, but at the ex- 
pense of a longer transition time for the 
backup to replace the failed unit. 

Duplex operations. You can often de- 
tect errors more completely by using 
identical modules in a duplex configura- 
tion (see figure lb). In this approach, 
two modules perform all operations in 
lock-step fashion, and they use a com- 
parator to detect any mismatch between 
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MODULAR REDUNDANCY APPROACHES 

a) Active/backup module pair c) Duplex modules with on-chip comparators 
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Figure 1 : (a) The active/backup module pair simply replaces a faulty module with a 
spare, (b) Duplex operations use two identical modules performing all operations 
and a comparator to detect any mismatch in outputs, (c) Duplex operations with on- 
chip comparators provide comparators at each output pin and master/checker 
operating mode, (d) In the triple-modular-redundancy (TMR) configuration, three 
identical modules perform each operation concurrently, and a voter determines a 
majority ruling, thus masking failures in any one module. 



the two sets of outputs. 

As soon as the system detects an error, 
it disables the outputs of the module and 
issues an error signal. You can then dis- 
card the entire duplex module and reas- 
sign its tasks, or you can run additional 
diagnostics to determine which of the 
two units in the module is faulty so that 
the good unit can continue on its own. 

Bell's 1A Processor used two identi- 
cal processors that brought 12 internal 
points out to comparators— two points 
during each clock cycle. If an error was 
detected, the diagnostics selected one of 
the two processors to continue operations 
until repairs could be made. 

Duplex operations with on-chip com- 
parators. Several recent VLSI devices, 
including the Intel APX-432 micropro- 
cessor family and the AMD 29000 RISC 
processor family, have incorporated sup- 
port for on-chip duplex operations. Each 
device includes comparators at each out- 
put pin and master/checker operating 
mode. 

One chip in each pair is designated the 



master and drives all outputs normally. 
The second chip, designated the check- 
er, disables its output drivers and sam- 
ples the outputs that the master chip sup- 
plies. The on-chip comparators within 
the checker detect any disagreements be- 
tween the two chips and provide the error 
signal (see figure lc). 

Process outputs can also be compared 
in software, allowing the two standard 
modules to operate in a loosely coupled 
duplex configuration. Typically, the two 
processes would exchange all critical in- 
formation and compare the two copies in 
software prior to using that information. 

Triple-modular redundancy (TMR). 
Duplex configurations detect errors 
without identifying which module is cor- 
rect or faulty. If you need continuous 
real-time operations, you do not have the 
time to stop the system to find out which 
unit is correct. Continuous operation re- 
quires that the system mask errors instan- 
taneously. Repair and reconfiguration 
operations have to take place either in 
parallel with normal operations or later 
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ALTERNATIVE TMR CONFIGURATIONS 

a) Loosely coupled TMR configuration 




b) Tightly coupled TMR configuration 
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Figure 2: (a) In a loosely coupled TMR configuration, software performs the voting. 
(b) In a tightly coupled TMR configuration, processors are each assigned to drive 
one bus through their bus-isolation logic and read all three buses via their input 
voters. 



at a more convenient time. 

TMR is the most common fault-mask- 
ing configuration in which three identi- 
cal modules perform each operation con- 
currently (see figure Id). A voter selects 
the overall output to correspond to the 
majority vote of the three modules, thus 
masking failures in any one of them. You 
can readily extend this process to n-mod- 
ular redundancy with n identical mod- 
ules and a corresponding majority voter. 

As in duplex operations, the vote in an 
n-modular redundant configuration can 
be performed in hardware, on a cycle-by- 
cycle basis, or in software, with modules 
exchanging data and using software vot- 
ing (see figure 2a). TMR is used in the 
Space Shuttle computer complex (which 
uses four processors) and in the experi- 
mental SIFT aerospace computer and its 
commercial counterpart, the August 
Systems industrial process-control com- 
puter (three processors). These systems 
exchange all critical data values and vote 
on them before they are used in any pro- 
gram step. 

While the TMR configuration in fig- 



ure Id tolerates any processor failure, it 
is still vulnerable to voter failure. If the 
probability of voter failure is significant, 
you can use three voters; for example, 
figure 2b shows the configuration of the 
Fault-Tolerant Multiprocessor (FTMP) 
developed at Charles Stark Draper Labs 
(Cambridge, MA) for aerospace applica- 
tions. 

In FTMP, you can group any three 
processors into a TMR triad, with each 
processor driving one of the redundant 
buses, reading all three buses, and voting 
on their inputs; thus, the failure of any 
processor or any voter disables the entire 
processor/voter pair. You can assign any 
other processor to replace the failed unit 
within the affected triad simply by tell- 
ing it which bus to drive. 

Self-checking module pairs. One al- 
ternative to voting that can perform con- 
tinuous error-free operation is to use 
self-checking module pairs. The quad- 
ruplex configuration (see figure 3) has 
been used in the 68000-based Stratus 32 
systems (Stratus Computer, Marlbor- 
ough, MA). 



In the quadruplex configuration, two 
pairs of duplex modules (a total of four 
processors and two comparators) per- 
form all operations concurrently. Each 
duplex module is self-checking in that a 
comparator detects any disagreements 
between its two processors. If such an 
error occurs, the system disables that 
module's output, and while it replaces 
the faulty module, the remaining duplex 
module continues operating alone. 

Several techniques are used to design 
modules that are self-checking but not 
replicated. The Bell System's 3A Proces- 
sor (successor to the 1A) uses two pro- 
cessors that operate autonomously except 
for periodically exchanging state infor- 
mation. Coding and self-checking logic 
within each processor enable a faulty 
processor to identify itself; when that 
happens, the second processor takes 
over, providing continuous error-free 
system operation. 

Information redundancy. Applying 
coding techniques to redundant bits can 
make it easier to detect and correct errors 
within an information word. Error-de- 
tecting and -correcting codes are the 
most widely used form of fault tolerance, 
with applications ranging from aerospace 
and military systems to laptop personal 
computers. 

The main attraction of coding is that it 
can detect and correct errors with signif- 
icantly less redundancy than you find 
with replicated modules. However, most 
coding schemes apply only where infor- 
mation is not transformed, such as in 
information storage or retrieval (e.g., 
memory, disk, and tape) and in data 
transmission over buses or communica- 
tions channels. 

How well a coding scheme detects or 
corrects errors depends on how well it 
can sort out the valid code words. A 
given number of errors must not be able 
to transform one valid code word to an- 
other; it must turn the code word into a 
noncode word. With additional redun- 
dancy, the separation between the two 
can be wide enough to associate specific 
noncode words with specific code words. 
When this occurs, a limited number of 
errors can be corrected. 

The separation between two binary 
words— referred to as the Hamming dis- 
tance—is defined as the number of bit 
positions in which the two words differ. 
Suppose two valid code words differ in a 
single bit position (i.e. , they have a sepa- 
ration of one). An error in that single bit 
position will transform one valid code 
word into another, and the error will be 
undetectable. 

If the minimum separation is two, then 
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for any valid code word, a single error 
can only produce a noncode word, and 
the error can be found. However, if two 
errors were to occur, one valid word 



could be converted to another valid 
word, and the errors would not be seen. 

If the minimum separation increases 
to three, however, each single error pro- 
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Figure 3: This configuration is used for continuous error-free operation. A detected 
error disables the output of a faulty module, letting the other module continue. 



duces a noncode word that can be unique- 
ly associated with its original code word. 
When this occurs, the system can pro- 
duce the correct data during decoding. 

Simple parity checking uses a single 
redundant bit to provide a minimum sep- 
aration of two. Words with even parity 
have an even number of 1 bits; therefore, 
a single error produces a word with an 
odd number of 1 bits, which identifies it 
as a noncode word. 

Hamming codes (often used to protect 
memory systems) compute multiple pari- 
ty bits for overlapping subsets of the bits 
within each data word. For a single 
error-correcting code, the overlap pro- 
vides a minimum separation of three, en- 
abling error correction. 

Figure 4 illustrates a memory system 
utilizing an error-detection and -correc- 
tion circuit. Check bits are computed and 
stored with the data during each memory 
write, and they are rechecked during 
each memory read; the data is corrected 
if errors are found. 

Cyclic-redundancy-check codes com- 
monly protect devices and communi- 
cations channels that use serial data 
transfers. In CRCs, linear-feedback shift 
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registers compute a set of check bits over 
an entire string of data and then store or 
transmit the check bits after the data. 
The same operation is performed when 
retrieving or receiving the information; 
the computed check bits are compared to 
the original check bits to detect errors. 
Some CRC codes, such as those used on 
some high-performance disk drives, also 
provide sufficient redundancy to correct 
a limited number of bit errors. 

Other Error-Detection Mechanisms 

The state of a digital system in the clock 
period following the current one is a 
function of only its current state and the 
system inputs. In any particular state, the 
number of "next" states and inputs that 
can occur is relatively small. Hence, 
special hardware or software can often 
detect an improper input or an incorrect 
next state. 

Several computer networks and sys- 
tem buses, especially those used in mili- 
tary and aerospace systems, are designed 
to ensure proper protocols during data 
transfers. They detect out-of-sequence or 
late-arriving events as system errors. 

Typically, the limits on how long it 



can take to perform a particular event are 
known. Special watchdog timers can de- 
termine when an event fails to occur 
within its time frame and signal that 
problems exist. A wide variety of sys- 
tems include time-out checks as an inex- 
pensive way of detecting system failures, 
since such failures typically prevent an 
event from completing within its given 
time limit. 

On several computers, operating-sys- 
tem software implements other protocol 
checks to ensure that the application pro- 
grams follow proper procedures. In ad- 
dition, most computers use special hard- 
ware to detect such errors as divide by 
zero, improper memory access, and non- 
existent op codes. Most of these devices 
are relatively inexpensive to implement, 
and they often supplement other error- 
detection mechanisms in a fault-tolerant 
system. 

The Three Rs 

For continuous system operation, a fault- 
tolerance strategy must include the three 
Rs: repair, reconfiguration, and recov- 
ery. You must render a faulty component 
unable to affect other system elements 
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by manually or electrically removing it 
from the system or by routing all infor- 
mation around it to effectively isolate it. 
After isolation, you can either replace the 
faulty unit with a spare and restore the 
system to full strength or continue to 
operate the system, but with fewer re- 
sources. 

Operating with fewer resources is re- 
ferred to as graceful degradation and is 
especially popular in multiprocessor sys- 
tems. In most multiprocessors, a number 
of processing elements share the work- 
load by distributing the tasks among 
themselves. If one processing element 
fails, then its tasks are redistributed to 
allow operations to continue; however, 
fewer processors performing the same 
amount of work will reduce overall per- 
formance. 

If performance degradation is not ac- 
ceptable for a given application, you must 
provide some number of spare modules. 
Synapse Computer (Milpitas, CA) mar- 
keted a multiprocessor called the N+l 
system; it provided n+l processing ele- 
ments for applications requiring n pro- 
cessors to achieve the desired perfor- 
mance. 



The N+l system stored all tasks in a 
queue in a common memory area. Each 
processor continuously selected a new 
task from the queue after completing its 
current task. If any single processor 
failed, it was disabled and its task reen- 
tered on the queue. The remaining n pro- 
cessors continued to select tasks from the 
queue, ensuring continued correct oper- 
ation. 

Once a system has been restored to full 
strength or reconfigured to isolate a 
faulty unit, its operational state must be 
set to a correct value. The extent of the 
recovery depends on the extent of the 
error propagation. The most common ap- 
proach is to restore, or roll back, the state 
of the system to a known good value. 

Current Trends 

With the ability to put more devices onto 
a single chip, VLSI designers are incor- 
porating on-chip fault-detection mecha- 
nisms to improve testability for initial 
part checkout, to support diagnostic op- 
erations, and to provide on-line error de- 
tection during normal operations. 

Additional on-chip features to support 
fault-tolerant system design are also be- 



ginning to appear, such as comparators 
at output pins to support duplex opera- 
tions. Many current memory chips in- 
clude on-chip logic to reconfigure the 
rows and columns of a memory array to 
isolate faulty storage cells (see "Chips 
That Work" on page 187). This is nor- 
mally done at initial testing time to in- 
crease yield by eliminating manufactur- 
ing defects. In some cases, faults 
occurring during normal operation can 
be tolerated in the same manner. 

While fault-tolerant design principles 
were once limited to special-purpose 
systems that had to be highly depend- 
able, their use is beginning to extend to 
general-purpose minicomputers and 
mainframes, as can be seen in current of- 
ferings from IBM and DEC. This trend 
will also continue into the personal com- 
puter arena, making fault tolerance an 
integral and cost-effective part of all 
computer systems. ■ 
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DATA COMPRESSION LIBRARIES™ SB K 

PKWARE's* Data Compression Libraries™ allow software developers to add data compression 
technology to software applications. The application program controls all the input and output 
of data allowing data to be compressed or extracted to or from any device or area of memory. 

• All Purpose Data Compression Algorithm Compresses Ascii or Binary Data Quickly with 
similar compression achieved by the popular PKzi? software, however the format used by 
the compression routine is completely generic and not specific to the PKZ1P file format. 

• Application Controlled I/O and memory allocation for extreme flexibility. 

• Adjustable Dictionary Size allows software to be fine tuned for Maximum Size or Speed. 

• Approximately 35K memory needed for Compression, 12K memory needed for Extraction. 

• Compatible with most popular Languages: C, C+ +, Pascal, Assembly, Basic, Clipper, Etc. 

• Works with any 80x86 family CPU in real or protected mode. $295.00 

• No runtime royalties. 

RUNNING OUT OF EXPENSIVE DISK SPACE? K 

can help! compresses your files to free up disk space and reduce modem 

transfer time. You can compress a single file or entire directory structures with a single 
command. Compressed files can be quickly returned to their normal size with PKunvp. 

Software developers can reduce the number of diskettes needed to distribute their product by 
using . Call for Distribution License information. 



The included utility lets you store 

compressed files as a single self-extracting 
.EXE files that automatically uncompresses 
when run. Only $47.00 
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CircU 1 72 on Inquiry Cord. 



